System and method for ultrasound color doppler imaging

ABSTRACT

Systems and methods for color Doppler imaging in an ultrasound imaging system are disclosed herein. An ultrasound imaging system includes color Doppler imaging circuitry. The color Doppler imaging circuitry is configured to estimate flow parameters. The imaging circuitry includes a radio frequency (“RF”) demodulator configured to produce in-phase and quadrature components of an ultra-sound data vector. The RF demodulator includes a table in memory that stores interleaved sine and cosine values. The RF demodulator maintains an index value for the table having higher precision than is used to index the table. The RF demodulator rounds the index value for each access of the table. Each table access retrieves a sine value and a cosine value.

The present application claims priority to and incorporates by referenceprovisional patent application 61/077,917, filed on Jul. 3, 2008,entitled “Efficient Implementation of Ultrasound Color DopplerAlgorithms Using Partial Sums,” provisional patent application61/087,019, filed on Aug. 7, 2008, entitled “Efficient Implementation ofUltrasound Color Doppler Algorithms,” and provisional patent application61/143,676, filed on Jan. 9, 2009, entitled “Efficient Implementation ofUltrasound Color Doppler Algorithms on VLIW Architectures.”

BACKGROUND

Ultrasound systems are becoming increasingly important tools fornon-invasive medical diagnostics. Consequently, ultrasound imagingsystems featuring reduced power consumption and improved portability aredesirable. Digital signal processors (“DSPs”) have been used to reducepower and increase portability of various types of medical imagingsystems, including ultrasound systems.

The most common operating modes of ultrasound systems are Brightnessmode (“B-mode”) and Color Doppler Mode. In B-mode, an ultrasound systemtransmits sound waves of a particular frequency into the body ofinterest and records the received echoes as a function of time andposition to image structural information. The brightness of each pointon a display represents the amplitude of the sampled echoes. ColorDoppler mode is similar to B-mode but transmits multiple pulses (anensemble) and uses the relative time (phase) between received echoes toestimate blood flow characteristics. Commonly estimated flow parametersinclude flow strength (power), direction, velocity, and turbulence.

SUMMARY

A system and method for color Doppler imaging in an ultra-sound systemare disclosed herein. In one embodiment, an ultrasound imaging systemincludes color Doppler imaging circuitry. The color Doppler imagingcircuitry is configured to estimate flow parameters. The imagingcircuitry includes a radio frequency (“RF”) demodulator configured toproduce in-phase and quadrature components of an ultrasound data vector.The radio frequency demodulator includes a table in memory that storesinterleaved sine and cosine values. The radio frequency demodulatormaintains an index value for the table having higher precision than isused to index the table. The radio frequency demodulator rounds theindex value for each access of the table. Each table access retrieves asine value and a cosine value.

In another embodiment, a method includes fetching from ultrasoundimaging system memory, in a single memory access cycle, a plurality ofWall filter coefficients. The plurality of Wall filter coefficients areapplied to a plurality of complex ultrasound data values in a singleexecution cycle. Wall filtered ultrasound data is provided to a flowestimator.

In another embodiment, an ultrasound color Doppler imaging systemincludes a processor and a software system for color Doppler imagingexecuted by the processor. The software system includes instructionsthat when executed cause the processor to convert real radio frequency(“RF”) ultrasound samples to complex RF ultrasound samples using sineand cosine values stored in adjacent memory locations.

In yet another embodiment, a color Doppler imaging system includescircuitry that computes a sum of powers of a first ensemble and a sum ofpowers of a second ensemble. The system also includes circuitry thatstores the sum of powers of the first ensemble and the sum of powers ofthe second ensemble in memory. The system further includes circuitrythat retrieves the sum of powers of the first ensemble and the sum ofpowers of the second ensemble from memory and computes a flow powerparameter using the retrieved sums of powers

BRIEF DESCRIPTION OF THE DRAWINGS

For a detailed description of exemplary embodiments of the invention,reference will now be made to the accompanying drawings in which:

FIG. 1 shows a block diagram of a receiver path of an ultra-soundimaging system configured to perform color Doppler imaging in accordancewith various embodiments;

FIG. 2 shows a very long instruction word (“VLIW”) processorarchitecture for implementing color Doppler imaging in accordance withvarious embodiments;

FIG. 3 shows a processing flow diagram for performing radio frequencydemodulation mixing in a color Doppler imaging system using a VLIWprocessor in accordance with various embodiments;

FIG. 4 shows a processing flow diagram for performing radio frequencydemodulation decimation in a color Doppler imaging system using a VLIWprocessor in accordance with various embodiments;

FIG. 5 shows a processing flow diagram for implementing a Wall filter ina color Doppler imaging system using a VLIW processor in accordance withvarious embodiments;

FIG. 6 shows a processing flow diagram for performing flow powerestimation in a color Doppler imaging system using a VLIW processor inaccordance with various embodiments;

FIG. 7 shows a processing flow diagram for performing correlation andpower computations in a color Doppler imaging system using a VLIWprocessor in accordance with various embodiments.

FIG. 8 shows an exemplary application of the direct approach to flowpower processing; and

FIG. 9 shows an exemplary application of a partial sums based approachto flow power processing in accordance with various embodiments.

NOTATION AND NOMENCLATURE

Certain terms are used throughout the following description and claimsto refer to particular system components. As one skilled in the art willappreciate, companies may refer to a component by different names. Thisdocument does not intend to distinguish between components that differin name but not function. In the following discussion and in the claims,the terms “including” and “comprising” are used in an open-endedfashion, and thus should be interpreted to mean “including, but notlimited to . . . ” Also, the term “couple” or “couples” is intended tomean either an indirect or direct electrical connection. Thus, if afirst device couples to a second device, that connection may be througha direct electrical connection, or through an indirect electricalconnection via other devices and connections. Further, the term“software” includes any executable code capable of running on aprocessor, regardless of the media used to store the software. Thus,code stored in memory (e.g., non-volatile memory), and sometimesreferred to as “embedded firmware,” is included within the definition ofsoftware. The terms “ultrasound” or “ultrasonic” generally refer toacoustic waves at frequencies beyond the range of human hearing (e.g.,frequencies above 20 KHz).

DETAILED DESCRIPTION

The following discussion is directed to various embodiments of theinvention. Although one or more of these embodiments may be preferred,the embodiments disclosed should not be interpreted, or otherwise used,as limiting the scope of the disclosure, including the claims. Inaddition, one skilled in the art will understand that the followingdescription has broad application, and the discussion of any embodimentis meant only to be exemplary of that embodiment, and not intended tointimate that the scope of the disclosure, including the claims, islimited to that embodiment.

Color Doppler mode ultrasound involves estimating blood flowcharacteristics using Doppler techniques, and is employed to diagnose avariety of conditions, for example blood clots, valve defects andblocked arteries. The processing demands make color Doppler mode one ofthe more computationally intensive applications of ultrasound. The highcomputational demands of color Doppler mode have required hardwareimplementations. Unfortunately, hardware implementations can berestrictive and expensive. Embodiments of the present disclosure exploitnovel processing algorithms that allow implementation of color Dopplermode ultra-sound in hardware or via software executed on a processor,for example, a very long instruction word (“VLIW”) processor, such as aC64X+™ processor produced by Texas Instruments Incorporated.

FIG. 1 shows a block diagram of a receive path 100 of an ultra-soundimaging system configured to perform color Doppler imaging in accordancewith various embodiments. Ultrasound receivers use amplitude and phaseinformation from reflected sound waves to provide structural andfunctional information about a body of interest. The receive path 100includes a transducer 102, receiver frontend/beamformer circuitry 104, abrightness mode (“B-mode”) imaging system 106, a color Doppler modeimaging system 108, and an image processor/display 110. The transducer102 detects ultrasonic waves reflected by internal structures of thesubject and converts the detected sound waves (i.e., pressure waves)into electrical signals. The transducer 102 can comprise a piezoelectriccrystal, electromagnetic transducer, micro-electro-mechanical system(“MEMS”) transducer or other device that converts sound waves into anelectrical signal. Moreover, the transducer 102 can comprise one or moretransducer elements. In some embodiments, the transducer 102 can convertelectrical drive signals generated by a transmitter into sound wavesthat are introduced into the subject to be imaged, for example, a humanbody when considering medical ultrasound. In some embodiments, the sametransducer elements are used to generate ultrasonic waves and to detectultrasonic waves. In other embodiments, separate transducer elements areused for wave generation and detection.

The receiver frontend/beamformer 104 receives the sound signals detectedby the transducer 102. The receiver frontend/beamformer 104 amplifies,filters, and digitizes the received signals. The digitized signals arescaled and delayed to permit coherent summation. The beamformed signalis provided to the B-mode imaging system 106 and/or the color Dopplermode imaging system 108.

The color Doppler mode imaging system 108 includes a radio frequency(“RF”) demodulator 112, a Wall filter 114, and flow estimators 116. TheRF demodulator 112 mixes, filters, and decimates echo data. RFdemodulation is used for B-mode as well as color Doppler modeprocessing. In color Doppler mode, the input to the RF demodulator 112is multiple ensembles of scan lines. There may be, for example, B scanline sets where each set consists of N scan lines. Thus, the ensemblesize is N. Each scan line consists of T real valued RF data samples. TheRF demodulator 112 produces decimated in-phase (I) and quadrature (Q)components for each of the scan lines, where the decimation factor is Sand the decimation filter includes L taps. Thus, the RF demodulatorproduces an output scan line including D (equal to T/S) decimatedpoints. The Wall filter uses D sets (ensembles) of scan lines, each setconsisting of N decimated points, to generate D flow or power estimates.The purpose of the Wall filter 114 is to process the demodulated echodata and filter out the motion of blood vessel walls. The flowestimators 116 can include a color flow estimator and/or a flow powerestimator. The color flow estimator estimates blood flow velocity,turbulence and power. The flow power estimator estimates only the flowpower and is used to determine the strength of blood flow or thepresence of blood.

Embodiments of the present disclosure employ an RF demodulator, WallFilter, and/or flow estimators using novel processing methods thatimprove the performance of the color Doppler imaging system 108. Morespecifically, embodiments improve the performance of the color Dopplerimaging system 108 when implemented on a processor, such as a VLIWprocessor.

In at least some embodiments of the ultra-sound imaging system receivepath 100, at least some of the operations discussed herein, for example,the RF demodulator 112, the Wall filter 114, and/or the flow estimators116 algorithmic operations described below can be implemented viasoftware programming executed by a processor, such as a DSP or VLIWprocessor. Software programming can be stored in a computer readablemedium, such as a semiconductor memory device, a magnetic or opticalstorage device, etc. accessible to the processor. In some embodiments,at least some of the operations of the RF demodulator 112, the Wallfilter 114, and/or the flow estimators 116 may be implemented indedicated hardware circuitry, for example an application specificintegrated circuit (“ASIC”) or field programmable gate array (“FPGA”).

When attempting to optimize an algorithm for any architecture, it isuseful to create complexity targets for the exercise. Towards this end,a simple “best-case” complexity analysis is presented to establish alower bound for the complexity (DSP cycles) of the subject algorithms onthe Texas Instruments'C64x+™ core. Although embodiments are explainedwith respect to C64x+ cores, the general methodology presented isapplicable to other VLIW architectures.

FIG. 2 shows a VLIW processor architecture (a TI C64x+ architecture) 200for implementing color Doppler imaging. Various elements of the VLIWprocessor architecture 200 such as control logic, cache memories, clockcircuitry, interrupt circuitry, peripherals, interconnecting buses,processor input/output ports, etc. have been omitted for clarity. Duringevery clock cycle in the TI C64x+ platform, the instruction fetch 202,dispatch 204, and decode units 206 deliver instructions to the eightfunctional units that reside in two data paths (A 208A and B 208B). Eachdata path contains four functional units (L 210A/B, S 212A/B, M 214A/B,and D 216A/B) and a register file 218A/B including 32 general-purpose32-bit registers. Generally, the L units 210A/B are configured for32/40-bit arithmetic operations and 32-bit logical operations, the Sunits 212A/B are configured for 32-bit arithmetic and logic operationsand 40-bit shifting, the M units 214A/B are configured to performmultiplication, and the D units 216A/B are configured for various dataload/store operations and 32-bit addition/subtraction and addresscomputation. Unless otherwise noted, every unit (L 210A/B, S 212A/B, M214 A/B, and D 216 A/B) produces a result at each clock cycle when theoperations are pipelined perfectly. TI C64X+ instructions referred toherein include:

-   -   ADD—add two signed integers without saturation.    -   DPACK2—parallel PACK2 and PACKH2 operations.    -   DOTP2—dot product of two pairs of signed, packed 16-bit values.    -   DOTPN2—dot product of two pairs of signed, packed 16-bit values        where the second product is negated.    -   LDW—load word (32-bits) from memory.    -   LDDW—load double word (64-bits) from memory.    -   MPY2IR—perform two 16-bit by 32-bit multiplies, including a        round and a shift by 15 bits to produce 32-bit results.    -   PACK2—move the lower half-word (16-bits) of src1 into the upper        half-word of dest, and the lower half-word of src2 into the        lower half-word of dest.    -   PACKH2—move the upper half-word of src1 into the upper half-word        of dest and the upper half-word of src2 into the lower half-word        of dest.    -   PACKLH2—move the lower half-word of src1 into the upper        half-word of dest and the upper half-word of src2 into the lower        half-word of dest.    -   SADD—add two signed integers, saturate the result.    -   SPACK2—saturate two 32-bit values to signed 16-bit values and        pack them into upper and lower register halves.    -   STDW—store double word (64-bits) to memory.

The complexity of the each algorithm is approximated by the complexityof the loop kernels, which dominate complexity of these algorithms. Thecomplexity analysis is “best-case” in the sense that it does not attemptto estimate overheads and is a first order approximation of thecomplexity. Overheads include epilog and prolog of loop kernels, outerloop operations in nested-loop implementations, etc. In order to achievethese estimates, an implementation may need to change the algorithm datapath (without affecting performance) to get the lowest complexity on theC64x+™ core. To arrive at these estimates, the most efficient data flowand CPU instructions are estimated for each of the algorithmicoperations. Then the various functional units to which theseinstructions can be mapped are considered and the cycles on the unitthat is most loaded is used as the estimate of the algorithm'scomplexity. Note that perfect pipelining performance is assumed.

Throughout this disclosure lower-case letters are used as indexes intoquantities whose maximum values are indicated by their upper-casecounterparts, e.g. t is an index into the T RF points of each scan line.Also upper case letters denote matrices and vectors while their lowercase counterparts denote the individual elements in thesematrices/vectors, e.g. r is an element of vector R. The subscripts usedwith matrices denote their dimensions whereas subscripts used withelements denote their positions in the matrix/vector, e.g. r_(t) denotesthe tth element in the vector R, whereas R_(T) denotes that R is avector of length T.

In some embodiments of the color Doppler imaging system 108, the RFdemodulator 112 takes 32-bit inputs and produces 16-bit decimatedresults. In such embodiments, the input and output bit precision forother modules (e.g., Wall filter 114) may be 16-bit, with the exceptionof the flow estimates produced by the flow estimators 116, which may be8-bits each. All tables and filter coefficients may be 16-bit, unlessotherwise stated. Software embodiments may be implemented using the Cprogramming language and compiler intrinsics (e.g., special functionsthat map to inlined processor instructions), and without use of assemblylanguage. Embodiments have been implemented using normalized mean squareerror and normalized maximum absolute error targets of 1e-6 and 1e-3.The normalization of the metrics was with respect to floating pointoutputs.

Returning now to the RF Demodulator 112, for a given transducer centerfrequency f_(c), and front end sampling frequency f_(s), the RFdemodulator 112 mixes the beamformed RF data received from the frontend104 for each scan line, R_(T), to create in-phase and quadraturecomponents of the vector, M_(T).

$\begin{matrix}{{{{Re}\left\{ m_{t} \right\}} = {r_{t} \cdot {\cos ({kt})}}},{k = \frac{2\pi \; f_{c}}{f_{s}}},{0 \leq t < {T - 1}},} & \left( {1a} \right) \\{{{{Im}\left\{ m_{t} \right\}} = {r_{t} \cdot {\sin ({kt})}}},{k = \frac{2\pi \; f_{c}}{f_{s}}},{0 \leq t < {T - 1.}}} & \left( {1b} \right)\end{matrix}$

These mixed outputs, M_(T), are then passed through a low-pass filter,F, to prevent aliasing and down-sampled in a single step, to generatethe decimated output vector, E_(D).

$\begin{matrix}{{{{Re}\left\{ e_{d} \right\}} = {\sum\limits_{l = 0}^{L}{{f_{l} \cdot {Re}}\left\{ m_{t - l} \right\}}}},{0 \leq d \leq {D - 1}},{t = {Sd}},} & \left( {2a} \right) \\{{{{Im}\left\{ e_{d} \right\}} = {\sum\limits_{l = 0}^{L}{{f_{l} \cdot {Im}}\left\{ m_{t - l} \right\}}}},{0 \leq d \leq {D - 1}},{t = {{Sd}.}}} & \left( {2b} \right)\end{matrix}$

As shown in equations (1a-1b), the mixing operation involvesmultiplication of an RF value r_(t) by a sine value and a cosine value.The sine and cosine values needed for the mixing operation can be storedin tables or computed on the fly. From a complexity perspective, tablesare preferable provided they can be kept reasonably small. One approachis using a table corresponding to sin(2πkt) and cos(2πkt) for allpossible t. Although this approach provides high precision and easylinear indexing, the tables may be large (T 16-bit values) and wouldneed to be regenerated for every change of k, resulting from a change ofscanning parameters.

Embodiments of the present disclosure include a small table (forexample, a 2 kilo-byte table consisting of 512 sine and 512 cosinevalues with each value represented by a 16-bit number, is sufficient insome embodiments) containing a cycle of interleaved sine and cosine data(sin(2πl)/cos(2πl)) and index the table circularly using k. Embodimentsprovide adequate performance using the small table by maintaining atotal indexing value at a high precision (e.g., a 32-bit number) androunding the indexing value to the closest table index only for look-uppurposes. Thus, all table index increments use 32-bits although indexingof the table would use fewer bits (e.g., 9 bits). Using such a table isadvantageous because a single table suffices for a variety of parametersettings—with only the step-size for table indexing, k, changed fordifferent parameter settings. Although it is possible to generate thecosine values from the sine table by appropriate offsets, a larger tablewith interlaced 16-bit sine/cosine values is preferable to simplifyindexing and to allow access to both sine and cosine values for a pointusing a single 32-bit load. To simplify indexing, a complete cycle ofthe sinusoid is preferably stored, rather than storing a quartersinusoid and using more complex indexing. To further simplify tableindexing, the table length is preferably a power of two.

FIG. 3 shows a processing flow diagram 300 for performing mixing in acolor Doppler imaging system using a VLIW processor in accordance withvarious embodiments. For the sake of simplicity, table indexing has beenomitted from FIG. 3. In block 302, RF values r_(t) and r_(t+1) are readfrom memory and loaded into registers. In blocks 304 and 306, sine andcosine values corresponding to the RF values are read from memory andloaded into registers. In blocks 308 and 310, the RF values aremultiplied by the sine and cosine values. The results are stored inregisters. The 32-bit register outputs are saturated to 16-bit values,and the four 16-bit values are stored in memory.

Mapping the operations/instructions of FIG. 3 to the functional units ofFIG. 2 produces the results shown in Table 1. Adding up the cycle-usageon the four types of processing units (L 210A/B, D 216A/B, S 212A/B, M214A/B), shows that the D units 216A/B would take the most cycles forthis algorithm, despite using the most efficient instructions available.Hence, this kernel can be considered to be “load-limited” and would needapproximately T cycles for processing T RF points, achieving a(best-case) performance of one pipelined kernel cycle per output point.Note that by virtue of the data flow chosen, and consequent use ofintrinsics, the S units 212A/B also need of one pipelined kernel cycleper output point.

TABLE 1 Mixing Operation Complexity Analysis C64x+ ™ Operation CyclesInstruction CPU Unit Loads - Input T/4 LDDW D1/D2 Loads - Sine/CosineT/2 LDW D1/D2 Multiplies T/2 MPY2IR M1/M2 Adds (Table indexing) T/4 ADDL1/L2/S1/S2 Saturate/Pack T/2 SPACK2 S1/S2 Modulus (Table indexing) T/2EXTU S1/S2 Stores T/4 STDW D1/D2

The mixing kernel described above, is typically followed by thedecimation kernel, which applies an anti-aliasing filter in addition toreducing the sampling rate by a factor of S. The decimation operationinvolves filtering across the T depth points for each scan-line. Atleast some embodiments, employ a polyphase implementation which isconceptually similar to clocking the mixed input samples into twofilters (one for the real and one for the imaginary parts) every cycleof the up-sampled clock, but clocking out a complex output only everyS^(th) clock cycle (every clock in the down-sampled domain). Thedecimation portion of the RF demodulator block takes T RF input samplesand outputs D (T/S) decimated samples. Thus, the filter (and itsassociated adds and multiplies) runs in the down-sampled domain.

FIG. 4 shows a processing flow diagram 400 for performing RFdemodulation decimation in a color Doppler imaging system using a VLIWprocessor in accordance with various embodiments. The flow comprises anouter loop 402 and an inner loop 404. The outer loop 402 runs only D,rather than T, iterations although the filtering operation may use all Tinput points. During each of the inner loop 404 iterations, four filtercoefficients 406 are read from memory and consumed. Similarly, fourcomplex RF values are read from memory (408, 410) and repacked (412,414). The repacked data is processed via dot product and summation. Notethe use of the intermediate 32-bit add (416) and the final 40-bit add(418). This distinction is notable because 32-bit adds can be performedon L 210A/B, S 212 A/B and D 216A/B units whereas 40-bit adds can takeplace only on L units 210A/B. Hence 32-bit adds should be used whereverpossible. In addition, minimizing use of the L units 210A/B, allows theL units 210A/B to be used for efficient packing instructions like DPACK2412, a parallel implementation of PACK2 and PACKH2 operations (packslower 16-bits of two words and upper 16-bits of two words), which onlyruns on the L units 210A/B.

Table 2 shows a mapping of the operations/instructions of FIG. 4 ontothe functional units of FIG. 2. The mapping shows that the decimationoperation would be multiplier-limited since the M units 214A/B wouldneed DL/2 cycles to complete their operations while the D 216A/B and S212A/B units would need fewer cycles to complete their operations.Again, by virtue of the data flow chosen, and therefore the usage ofintrinsics, the L units 210A/B also need of DL/2 cycles to completetheir operations. Note that the above analysis ignores the overhead forloop counter updates, pointer increments and filter circular indexing,but that such operations can be accommodated them on the unused D 216A/Band S 212A/B units.

TABLE 2 Decimation Complexity Analysis Operation Cycles C64x+ ™Instruction CPU Units Loads - Input DL/4 LDDW D1/D2 Loads - Filter DL/8LDDW D1/D2 Multiplies DL/2 DOTP2 M1/M2 Data manipulation DL/4 DPACK2L1/L2 Adds (40-bit) DL/4 ADD L1/L2 Adds (32-bit) DL/6 ADD S1/S2, D1/D2,L1/L2

For improved pipelining, another (identical) set of operations may beperformed inside the inner loop. To achieve this, embodiments unroll theouter loop (and process two depth points in parallel), instead offurther unrolling the inner loop (and processing eight filtercoefficients in parallel). This is because every time the inner loop isunrolled, the overhead (epilog/prolog) of the loop increases, and theoverhead is executed D times. In contrast, unrolling the outer loopresults in increasing its overhead (which is executed just once) andexecuting the inner loop overhead only D/2 times. Additionally, byunrolling the outer loop, L can be a multiple of four and D can be amultiple of two, rather than L being a multiple of eight. Using theseconstraints, at least some embodiments achieve pipelined kernelperformance of L/2 cycles per output point.

To provide for operation in systems with limited memory, embodimentsallow the filter memory to be set and returned, thereby allowing asoftware embodiment to be called multiple times to process partialsections of the scan-line. This increases the overhead slightly sinceseparate loops are used to process the first L samples at the start andthe state vector is updated before exiting the function.

Returning now to the Wall filter 114, the Wall filter 114 is a high-passfilter used to reduce high-amplitude, low-velocity echoes from vesselwalls, also called wall thump. Embodiments implement the Wall filter 114as an infinite impulse response (“IIR”) filter because the Wall filter'sresponse should be very sharp. However, since it needs to filter veryshort sequences (of ensemble size N), the transient performance of theWall filter 114 is important. To gain greater control over its transientperformance, embodiments employ a state-space formulation of the IIRfilter. This filter can be used with different types ofinitialization—the most common forms being step and projectioninitialization schemes. For additional information on the filterformulation see Steinar Bjaxum et al., Clutter Filter Design forUltrasound Color Flow Imaging, IEEE TRANSACTIONS ON ULTRASONICS,FERROELECTRICS, AND FREQUENCY CONTROL, February 2002, at 204, which isincorporated herein by reference in its entirety.

The basic operation performed for filtering each scan-line set is amultiplication of complex input data matrix, X, with real coefficientmatrix, W, given by,

Y _(D×N) =X _(D×N) ·W _(N×N),  (3)

Both the input and output data matrices include D rows (corresponding tothe D depth points) and N columns (corresponding to the N ensembles).Note that the matrix X is created from N decimated scan lines, E,forming the ensemble. The filter coefficient matrix W includes N rowsand N columns. Embodiments may pre-compute the matrix W.

FIG. 5 shows an exemplary processing flow diagram 500 for implementing aWall filter in a color Doppler imaging system using a VLIW processor inaccordance with various embodiments. The flow includes processing loops502, 504, and 506. In order to load filter coefficients efficiently,embodiments store the filter coefficient matrix W column-wise in memory(i.e., the transpose of the coefficient matrix W is stored). By storingthe transpose of the coefficient matrix, the coefficients on the samecolumn of W, are stored in adjacent memory addresses and can be 64-bitaligned and loaded using wide-load instructions for multiplying the rowsof the input matrix. The input and output data are preferably arrangedensemble-by-ensemble, i.e. all the N input/output points at a givendepth point, d, lie adjacent to each other in memory, before the Ninput/output points of the next depth point.

As shown in equation (3) and FIG. 5, the filtering operation comprisesmultiplication of a complex matrix with a real matrix. Complex samples(508) and filter coefficients (510) are read from memory. The samplesare repacked (512) and processed with the filter coefficients and priorfilter outputs to produce a complex filtered data sample. Someembodiments can employ multiple such kernels in parallel in order toimprove pipelining and to allow use of efficient shifting, packing andstorage instructions in the last stage.

Table 3 shows an exemplary mapping of the operations of FIG. 5 to thefunctional units of FIG. 2. The mapping of Table 3 shows that the Munits 214A/B are most heavily used as they are used for the two sets(real and imaginary) of 16×16 multiplies. To generate each of the DNcomplex output points, 2N real multiplies are needed for each point,taking DN²/2 cycles. Since all other functional units are expected totake fewer cycles than the M unit 214A/B, this embodiment of thealgorithm should need DN²/2 cycles for processing D decimated points.

TABLE 3 Wall Filtering Complexity Analysis Operation Cycles C64x+ ™Instruction CPU Units Loads - Input DN/4 LDDW D1/D2 Loads - CoefficientsN²/2 LDDW D1/D2 Multiplies DN²/2 DOTP2 M1/M2 Adds DN²/4 SADD L1/L2,S1/S2 Data Manipulation DN²/4 DPACK2 L1/L2

The pipelined kernel of at least some embodiments achieves performanceof N/2 cycles per output point, matching the estimates given above. Notethat algorithms involving nested loops need to be implemented carefullybecause only the innermost loop may be pipelined and the overheads ofthe inner loops are executed each time the outer loop runs. Typically,the best way to handle such situations is to unroll the innermost loopcompletely, and thereby reduce the number of loops. If the value of N isknown a-priori, this is the preferred approach. However, if it isdesired to keep N variable, all three loops 502, 504, 506 should beretained, but each loop may be partially unrolled. Depending on theextent to which each of the loops is unrolled, the cycle-performance ofthe function may differ, even if the innermost kernel cycles remain thesame in each case. This is because of the overheads of the loops changebased on the loop unrolling factors. For example, with D=128, N=8various implementations of the algorithms provided the following cyclecomplexities:

Theoretical complexity estimate was DN²/2=4096 cycles.

-   -   An embodiment retaining all three loops 502, 504, 506 and        unrolling each one twice has cycle-performance of 8.9 k cycles.        Note that in this embodiment, the real-imaginary bit packing was        done inside the innermost loop 506, since there were L/S units        available. Only the packing of the final result and storage were        done in the level 2 loop 504, thereby reducing the operations        that are not pipelined. The embodiment takes 0.5DN²+3.5DN+9.5D+8        cycles to process D points. Again, notice that the first term of        the equation matched the ideal estimate for the innermost        kernel.    -   By completely unrolling the innermost loop, such that the        implementation consists of only two nested loops, an embodiment        achieves cycle-performance of 5.7 k cycles.

Turning now to the flow estimators 116, a flow power estimator computesthe power of an ensemble of points. Flow power is also estimated by theflow parameter estimator discussed below. A separate implementation isconsidered here because flow power may be estimated in a variety ofcontexts. For example, a flow power estimator can be used to calculatethe input and output power of the Wall filter 114, to compute asignal-clutter ratio, or be used for detecting the presence of blood incolor Doppler processing—when other flow parameters are not used. For acomplex input matrix X_(D×N)=I_(D×N)+jQ_(D×N), the output vectorP_(D)={p_(d)} can be computed as,

$\begin{matrix}{{p_{d} = {{\sum\limits_{n = 0}^{N - 1}i_{d,n}^{2}} + q_{d,n}^{2}}},{0 \leq d \leq {\left( {D - 1} \right).}}} & (4)\end{matrix}$

Again, the input data may be arranged ensemble-by-ensemble.

FIG. 6 shows an exemplary processing flow diagram 600 for performingflow power estimation in a color Doppler imaging system using a VLIWprocessor in accordance with various embodiments. The flow diagram 600includes an inner loop 604 and an outer loop 602. Both the outer loopand the inner loop have been unrolled in parallel to facilitatescheduling without imposing large multiplicity constraints on N. Anotherreason embodiments limit the unrolling of the inner loop 604 is toreduce the overhead of the inner loop, which is executed each time theouter loop 602 runs.

Table 4 shows an exemplary mapping of the operations/instructions ofFIG. 6 to the functional units of FIG. 2. Table 4 shows that thisembodiment is multiply-limited and would need DN/2 cycles to process Ddecimated points translating to a pipelined performance of N12cycles/output point. Note that in this implementation the 40-bit adds(executed on L units 210A/B) take approximately the same number ofcycles as the multiplies. However, in some embodiments, some of these40-bit adds can be replaced by 32-bit adds which can take place on L210A/B, S 212 A/B, or D 216A/B units. Hence, although both the M 214A/Band the L 210A/B units take the same number of cycles, the cycles takenby the M unit 214A/B are less implementation-dependent and thereforedrive the theoretical estimates.

TABLE 4 Flow Power Computation Complexity Analysis Operation CyclesC64x+ ™ Instruction CPU Units Loads DN/4 LDDW D1/D2 Multiplies DN/2DOTP2 M1/M2 Adds DN/2 ADD L1/L2

In at least some embodiments, the inner kernel performance can bematched to the above estimates, i.e. the kernel uses N/2 cycles for eachoutput point. In such an embodiment, all the additions use 40-bitaccumulators and the additions are performed by the L units 210A/B. Suchan embodiment achieves performance of DN/2+4.5D+17 cycles for D points.

The flow estimators 116 can also include a flow parameter estimator(i.e., a color flow estimator). The flow parameter estimator uses theoutput Y of the Wall filter 114 to estimate the blood flow velocity,power and turbulence using autocorrelation based techniques. At depthpoint d, for computing both velocity, v_(d), and turbulence, t_(d),estimates, the flow parameter estimator uses the correlations betweenadjacent ensemble points, c_(d), and ensemble power, p_(d), as shownbelow,

$\begin{matrix}{{p_{d} = {\sum\limits_{n = 0}^{N - 2}{y_{d,n} \cdot y_{d,n}^{*}}}},} & \left( {5a} \right) \\{{c_{d} = {\sum\limits_{n = 0}^{N - 2}{y_{d,{n + 1}} \cdot y_{d,n}^{*}}}},} & \left( {5b} \right) \\{{v_{d} = {\tan^{- 1}\left\{ \frac{{Im}\left( c_{d} \right)}{{Re}\left( c_{d} \right)} \right\}}},} & \left( {5c} \right) \\{{t_{d} = {1 - {{saturate}\left\{ \frac{c_{d}}{p_{d}} \right\}}}},} & \left( {5d} \right)\end{matrix}$

The flow parameter estimator provides flow power, velocity andturbulence estimates. In at least some embodiments, the input,Y_(D×N)={y_(d,n)} includes packed 16-bit complex values while the power,velocity and turbulence outputs comprise 8-bit real data. For additionalinformation on flow parameter estimation see Chihiro Kasai et al.,Real-Time Two Dimensional Blood Flow Imaging Using an AutocorrelationTechnique, IEEE TRANSACTIONS ON SONICS AND ULTRASONICS, 1985, at 458,which is incorporated herein by reference in its entirety.

Some embodiments of the flow parameter estimator use a DSP math libraryto implement the four quadrant inverse tangent (_IQNatan2), division(_IQNdiv) and magnitude (_IQNmag) calculations. Each of these functionsuses 32-bit inputs and produces 32-bit outputs. Lower precision may besufficient for some embodiments. In at least some embodiments, thecorrelation outputs are 32-bits and all intermediate sums use 32-bitaccumulators.

The flow parameter estimator includes three kernels—one for power andcorrelation, a second for velocity and a third for turbulence estimates.FIG. 7 shows a processing flow diagram 700 for performing correlationand power computations in a color Doppler imaging system using a VLIWprocessor in accordance with various embodiments. An inner loop 704 andan outer loop 702 are shown. Packed complex sample values are loadedfrom memory (706). The complex correlation and real power values arecomputed using dot product and addition operations. Output values arepacked and stored.

Table 5 shows an exemplary mapping of the operations of FIG. 7 to thefunctional units of FIG. 2, to arrive at an estimate of the complexityof the kernel. Table 5 shows this kernel to be multiply limited, with aminimum of 3DN/2 cycles for processing D decimated points. Per equation(5c), velocity is computed taking the inverse tangent of the correlationat each depth point. In some embodiments, the inverse tangentcomputation takes ˜32 cycles per depth point using a pipelinedimplementation of _IQNatan2. The turbulence calculation involves using_IQNmag and _IQNdiv functions to perform a magnitude and division ateach depth point, taking ˜13 and ˜11 cycles respectively in pipelinedimplementations. Thus, the turbulence calculation is expected to take˜24 cycles per output depth point, yielding a total pipelinedperformance of 3DN/2+56D cycles for D points. From this complexityequation, it can be seen that for small values of N, embodiments canachieve significant complexity savings in the velocity and turbulenceestimation loops by using lower precision (and complexity) variants ofalgorithms used for computing magnitudes, divisions and inverses of thetangent.

In at least some embodiments, the outer loop 702 for thecorrelation-power computation kernel can be unrolled twice in order toachieve optimum performance. Such unrolling of the outer loop 702 allowsthe functional units to be balanced on each side, minimizes cross paths,and allows the use of efficient packing instructions to pack and storethe results for the two depth points at the end of the inner loop 704.Furthermore, some embodiments perform additional operations in thevelocity and turbulence computation kernels in order to produce 8-bitresults from the 32-bit _IQNatan2, and _IQNdiv outputs. Embodimentsinclude an additional saturate operation in the turbulence kernel inorder to ensure that the turbulence result always remains positive.Overall, at least some embodiments achieve performance of 3DN/2+62D+234cycles for D points, with the kernels matching the theoreticalprojections exactly.

TABLE 5 Correlation and Power Computation Complexity Analysis OperationCycles C64x+ ™ Instruction CPU Units Loads DN/4 LDDW D1/D2 Multipliescorrelations DN DOTP2, DOTPN2 M1/M2 Power DN/2 DOTP2 M1/M2 Addscorrelations DN ADD L1/L2 Power DN/2 ADD L1/L2 Data manipulation DN/4PACKLH2 L1/L2, S1/S2

At least some embodiments of the flow estimators 116 use two-dimensionalcomputations to produce the flow power, turbulence, and velocityestimates. For additional information regarding flow estimation seeChihito Kasai, referenced supra; Thanasis Loupas et al., ExperimentalEvaluation of Velocity and Power Estimation for Ultrasound Blood FlowImaging, by Means of a Two-Dimensional Autocorrelation Approach, IEEETRANSACTIONS ON ULTRASONICS, FERROELECTRICS, AND FREQUENCY CONTROL, July1995, at 689; Yi Zheng & James F. Greenleaf, Stable and Unbiased FlowTurbulence Estimation from Pulse Echo Ultrasound, IEEE TRANSACTIONS ONULTRASONICS, FERROELECTRICS, AND FREQUENCY CONTROL, September 1999, at1074 which are incorporated herein by reference in their entirety.Embodiments use echoes at M depth points (adjacent to a depth d) from Nscan lines (i.e., MN points) to achieve better performance thanembodiments using echo data at only depth d from N scan lines (Npoints). Embodiments use partial sums to reduce the complexity of thetwo-dimensional flow estimations.

Given the input complex data, Y_(D×N)={y_(d,n)}, the real flow poweroutputs, P_(D)={p_(d)}, are computed as,

$\begin{matrix}{{p_{d} = {\sum\limits_{m = 0}^{M - 1}{\sum\limits_{n = 0}^{N - 1}{y_{({{d + m},n})}y_{({{d + m},n})}^{*}}}}},} & (6)\end{matrix}$

The power computation for each input point involves 2 real multiples and1 add, and since a sum of MN such power estimates are needed for eachoutput point, a baseline (direct) implementation of equation (6) canrequire about, 2DMN real multiplies for each scan line and D(2MN−1) realadds for each scan line.

Embodiments reduce the computational load by storing a partialsum-of-powers at each depth point in a moving window of size M. Usingthe partial sum, embodiments obtain each output point by applying 2Nmultiplications (N for the real part and N for the imaginary part) and2N+1 additions, as opposed to 2MN multiplications and (2MN−1) additions.Thus for D points, embodiments use 2DN real multiplies for each scanline and D(2N−1) real adds for each scan line, shrinking the overallalgorithm to ≈1/M the complexity of the direct implementation with noperformance degradation. Additional memory is limited to an extrasliding window buffer to store the M partial sums.

FIG. 8 illustrates the direct implementation of equation (6). In FIG. 8,N=4, M=2, and D=8. The direct approach uses 2MN multiplications and(2MN−1) additions at each depth point, resulting in 2DMN multiplicationsand D(2MN−1) additions for each scan line. As shown in FIG. 8, the sumof powers of points (2, 1) to (2, 4) were used for both S1 and S2computations. Similarly the sum of powers of points (3, 1) to (3, 4)were used for both S2 and S3 computations and so on.

FIG. 9 illustrates the partial sums approach to flow power computationin accordance with various embodiments. A shown in FIG. 9, the sums ofpowers for each set of N points (P1-P4) are computed, stored, and usedto compute multiple flow power values (e.g., P1 is used compute S1 andS2). Thus, the partial-sums approach reduces the number of multiplies to2N and the number of adds to N+1 for every depth point d.

Embodiments use a similar approach to reduce the complexity of theautocorrelation sums used for velocity and turbulence computations. Boththese algorithms compute correlations along the ensemble (corr_n) anddepth (corr_m) directions as shown below,

$\begin{matrix}{{{corr\_ n}_{d} = {\sum\limits_{m = 0}^{M - 1}{\sum\limits_{n = 0}^{N - 2}{y_{{d + m},{n + 1}} \cdot y_{{d + m},n}^{*}}}}},} & (7) \\{{corr\_ m}_{d} = {\sum\limits_{m = 0}^{M - 2}{\sum\limits_{n = 0}^{N - 1}{y_{{d + m + 1},n} \cdot {y_{{d + m},n}^{*}.}}}}} & (8)\end{matrix}$

An embodiment using a direct implementation of equations (7) and (8)uses

≈4DM(N−1) real multiplies for each scan line (for ensemble correlation),

≈4D(M−1)N real multiplies for each scan line (for depth correlation),

≈2DM(N−1) real adds for each scan line (for ensemble correlation), and

≈2D(M−1)N real adds for each scan line (for depth correlation).

An embodiment using the more efficient partial sums implementationreduces complexity by a factor of M to

≈4D(N−1) real multiplies for each scan line (for ensemble correlation),

≈4DN real multiplies for each scan line (for depth correlation),

≈2D(N+1) real adds for each scan line (for ensemble correlation), and

≈2D(M+1) real adds for each scan line (for depth correlation).

Table 6 shows exemplary CPU cycles consumed by embodiments of thevarious modules of the color Doppler imaging system 108 discussed above.As shown, RF demodulation takes the bulk of the cycles. Together all themodules take ˜250 MHz of processing power, or ˜25% of a 1 GHz DSP. Table6 also shows that the innermost kernels take most of the cycles in casesof 1-level loops or simple 2-level loops where the inner loopepilog/prolog is small and minimal operations are done in the outerloop, e.g., the flow parameter estimator. In complicated 2 or 3-levelnested loops as in the decimation and wall filter kernels, the functiontakes significantly more cycles than those taken by the innermostkernels.

TABLE 6 Overall complexity of color flow algorithms Parameters FrameRate, F, = 30 frames/second Number of scan line (sets), B,: 64Decimation factor, S,: 8 Number of decimated depth points, D,: 128Number of ensembles, N,: 16 Filter taps, L, = 32 Innermost kernel cyclesTotal Cycles Module (MHz) (MHz) RF demodulation 94 190 Wall Filter 32 48Flow Parameter 20 22 Estimation Total 146 260

The above discussion is meant to be illustrative of the principles andvarious embodiments of the present invention. Numerous variations andmodifications will become apparent to those skilled in the art once theabove disclosure is fully appreciated. It is intended that the followingclaims be interpreted to embrace all such variations and modifications.

1. An ultrasound imaging system, comprising: color Doppler imagingcircuitry configured to estimate flow parameters, the imaging circuitrycomprising: a radio frequency (“RF”) demodulator configured to producein-phase and quadrature components of an ultra-sound data vector;wherein the radio frequency demodulator includes a table in memorystoring interleaved sine and cosine values, and the radio frequencydemodulator maintains an index value for the table having higherprecision than is used to index the table, and rounds the index valuefor each access of the table, each table access retrieving a sine valueand a cosine value.
 2. The ultrasound imaging system of claim 1, whereinthe radio frequency demodulator comprises a decimator configured todecimate the ultra-sound data vector, and an outer loop and an innerloop of the decimator are unrolled.
 3. The ultrasound imaging system ofclaim 2, wherein the decimator consumes at least four filtercoefficients and processes at least two depth points in parallel witheach iteration of the inner loop.
 4. The ultrasound imaging system ofclaim 1, wherein the color Doppler imaging circuitry further comprises aWall filter configured to reduce vessel wall echoes, the Wall filtercomprises a matrix of filter coefficients stored in memory, wherein theWall filter retrieves a plurality of filter coefficients per access andapplies the retrieved filter coefficients to a plurality of data valuesstored memory, wherein one of the filter coefficients and data values isstored column-wise and the other of the filter coefficients and datavalues is stored row-wise.
 5. The ultrasound imaging system of claim 1,wherein the color Doppler imaging circuitry comprises a flow estimatorthat computes a partial flow value and uses the partial flow value tocompute a plurality of flow parameter values.
 6. The ultrasound imagingsystem of claim 5, wherein the flow parameter values comprise at leastone of a flow parameter selected from a group consisting of flow power,flow velocity, and flow turbulence.
 7. The ultrasound imaging system ofclaim 1, wherein the RF demodulator mixes at least four RF points inparallel and outputs a complex output point per cycle.
 8. The ultrasoundimaging system of claim 1, wherein the color Doppler imaging circuitrycomprises a flow parameter estimator that computes flow power andcorrelation between adjacent ensemble points, wherein the flow parameterestimator comprises an inner loop and an outer loop and the outer loopis unrolled twice.
 9. The ultrasound imaging system of claim 1, whereinthe imaging circuitry comprises: a processor; and a memory coupled tothe processor; wherein the processor is configured to execute a colorDoppler software system stored in the memory, and the color Dopplersoftware system comprises instructions for performing RF demodulation,instructions for performing Wall filtering, and instructions forperforming flow estimation.
 10. The ultrasound imaging system of claim1, wherein the color Doppler imaging circuitry further comprises a Wallfilter configured to reduce vessel wall echoes and including nestingprocessing loops wherein the innermost loop is completely unrolled. 11.A method, comprising: fetching from ultrasound imaging system memory, ina single memory access cycle, a plurality of Wall filter coefficients;applying the plurality Wall filter coefficients to a plurality ofcomplex ultrasound data values in a single execution cycle; andproviding Wall filtered ultrasound data to a flow estimator.
 12. Themethod of claim 11, further comprising: fetching from a table stored inultrasound imaging system memory, in a single memory access cycle, asine value and a cosine value; rounding an index value maintained athigher precision than is used to access the table to a precision used toaccess the table; updating the index value at the higher precision; andgenerating a complex ultrasound radio frequency (“RF”) data value byapplying the sine and cosine value to an RF ultrasound data sample. 13.The method of claim 11, further comprising demodulating ultrasound RFdata using a nested processing loop wherein both inner and outer loopsof the nested processing loop are unrolled.
 14. The method of claim 11,further comprising decimating complex ultrasound RF data wherein thedecimation uses at least four anti-alias filter coefficients during eachiteration of an inner filter loop.
 15. The method of claim 11, furthercomprising computing two-dimensional flow parameter estimates by storinga set of partial sum of powers values and computing a flow parametervalue based on a stored partial sum of powers value.
 16. An ultrasoundcolor Doppler imaging system, comprising: a processor; and a softwaresystem for color Doppler imaging executed by the processor; wherein thesoftware system comprises instructions that when executed cause theprocessor to convert real radio frequency (“RF”) ultrasound samples tocomplex RF ultrasound samples using sine and cosine values stored inadjacent memory locations.
 17. The system of claim 16, wherein theprocessor is a very long instruction word (“VLIW”) processor.
 18. Thesystem of claim 17, wherein the VLIW processor comprises a plurality ofparallel data execution paths, each execution path comprising aplurality of execution units, each instruction of the software systembeing mapped to an execution unit.
 19. The system of claim 16, whereinthe software system comprises instructions that when executed cause theprocessor to decimate the complex RF ultrasound samples.
 20. The systemof claim 19, wherein the software system comprises instructions thatwhen executed cause the processor to compute flow parameters from thedecimated complex ultrasound samples, the computed flow parameterscomprising at least one of flow power, flow turbulence, and flowvelocity.
 21. The system of claim 19, wherein the software systemcomprises instructions that when executed cause the processor to filtervessel wall motion from the decimated complex ultrasound samples. 22.The system of claim 21, wherein the ultrasound Doppler imaging systemcomprises a transpose Wall filter coefficient matrix stored in memory,and the software system causes the processor to fetch a plurality ofWall filter coefficients in a single access.
 23. The system of claim 16,wherein the software system comprises a plurality of nested processingloops and both an outer loop and an inner loop of at least one of theplurality of nested processing loops are unrolled.
 24. A color Dopplerimaging system, comprising: circuitry that computes a sum of powers of afirst ensemble and a sum of power of a second ensemble; circuitry thatstores the sum of powers of the first ensemble and the sum of powers ofthe second ensemble in memory; circuitry that retrieves the sum ofpowers of the first ensemble and the sum of powers of the secondensemble from memory and computes a flow power parameter using theretrieved sums of powers.
 25. The color Doppler imaging system of claim24, further comprising: circuitry that computes a correlation sum of afirst ensemble and a correlation sum of a second ensemble; circuitrythat stores the correlation sum of the first ensemble and thecorrelation sum of the second ensemble in memory; circuitry thatretrieves the correlation sum of the first ensemble and the correlationsum of the second ensemble from memory and computes at least one of aflow velocity and a flow turbulence using the retrieved correlationsums.
 26. The color Doppler imaging system of claim 24, furthercomprising: circuitry that computes an correlation sum of depth pointsof a first set two different ensembles (CorrSum1) and a correlation sumof depth points of a second set of two different ensembles (CorrSum2);circuitry that stores the CorrSum1 and the CorrSum2 in memory; circuitrythat retrieves the CorrSum1 and the CorrSum2 from memory and computes atleast one of a flow velocity and a flow turbulence using the retrievedcorrelation sums.